IST707 HW1: Data Preparation and Association Rule Mining

Part 0: Getting Started

Load packages and data

In [1]:
# import necessary libraries

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import apyori as ap 
from apyori import apriori #Apriori Algorithm
import mlxtend as ml
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
import arules

# data formating 

pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
pd.set_option('max_colwidth',100)
In [2]:
# read data into a pandas dataframe

df = pd.read_csv("employee_attrition.csv",sep=',')

Reading and Displaying Data

  • Number of Rows and Columns: 1176 rows and 35 columns
  • Column Labels: Each column label is unique
  • Empty Column: All columns are non-null
  • Type of Values: float64(7), int64(19), object(9)
In [3]:
df.head(3)
Out[3]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 30 No Travel_Rarely 1358 Sales 16.0 1 Life Sciences 1 1479 4 Male 96 3 2.0 Sales Executive 3 Married 5301 2939 8 Y No 15.0 3.0 3.0 80 2 4.0 2 2 2 1 2.0 2
1 52 No Travel_Rarely 1325 Research & Development 11.0 4 Life Sciences 1 813 4 Female 82 3 2.0 Laboratory Technician 3 Married 3149 21821 8 Y No 20.0 4.0 2.0 80 1 9.0 3 3 5 2 1.0 4
2 42 No Travel_Rarely 462 Sales 14.0 2 Medical 1 936 3 Female 68 2 2.0 Sales Executive 3 Single 6244 7824 7 Y No 17.0 3.0 1.0 80 0 10.0 6 3 5 4 0.0 3
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176 entries, 0 to 1175
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       1176 non-null   int64  
 1   Attrition                 1176 non-null   object 
 2   BusinessTravel            1176 non-null   object 
 3   DailyRate                 1176 non-null   int64  
 4   Department                1176 non-null   object 
 5   DistanceFromHome          1174 non-null   float64
 6   Education                 1176 non-null   int64  
 7   EducationField            1176 non-null   object 
 8   EmployeeCount             1176 non-null   int64  
 9   EmployeeNumber            1176 non-null   int64  
 10  EnvironmentSatisfaction   1176 non-null   int64  
 11  Gender                    1175 non-null   object 
 12  HourlyRate                1176 non-null   int64  
 13  JobInvolvement            1176 non-null   int64  
 14  JobLevel                  1175 non-null   float64
 15  JobRole                   1176 non-null   object 
 16  JobSatisfaction           1176 non-null   int64  
 17  MaritalStatus             1176 non-null   object 
 18  MonthlyIncome             1176 non-null   int64  
 19  MonthlyRate               1176 non-null   int64  
 20  NumCompaniesWorked        1176 non-null   int64  
 21  Over18                    1176 non-null   object 
 22  OverTime                  1175 non-null   object 
 23  PercentSalaryHike         1175 non-null   float64
 24  PerformanceRating         1175 non-null   float64
 25  RelationshipSatisfaction  1175 non-null   float64
 26  StandardHours             1176 non-null   int64  
 27  StockOptionLevel          1176 non-null   int64  
 28  TotalWorkingYears         1174 non-null   float64
 29  TrainingTimesLastYear     1176 non-null   int64  
 30  WorkLifeBalance           1176 non-null   int64  
 31  YearsAtCompany            1176 non-null   int64  
 32  YearsInCurrentRole        1176 non-null   int64  
 33  YearsSinceLastPromotion   1175 non-null   float64
 34  YearsWithCurrManager      1176 non-null   int64  
dtypes: float64(7), int64(19), object(9)
memory usage: 321.7+ KB

Part 1: Data Understanding and Data Preparation

  • Exploratory data analysis (EDA): derive descriptive statistics and apply data visualization to check for interesting data patterns

1.1 Derive Descriptive Statistics

Statistical summary of columns interpreted as numbers

  • From the count column, we can see that there are missing values
  • From the std column, we find two columns with no variations
In [5]:
df.describe().T
Out[5]:
count mean std min 25% 50% 75% max
Age 1176.0 36.960034 9.071964 18.0 30.00 36.0 43.00 60.0
DailyRate 1176.0 800.386905 404.709216 102.0 461.75 796.0 1162.00 1499.0
DistanceFromHome 1174.0 9.495741 10.251379 1.0 2.00 7.0 14.00 224.0
Education 1176.0 2.895408 1.025685 1.0 2.00 3.0 4.00 5.0
EmployeeCount 1176.0 1.000000 0.000000 1.0 1.00 1.0 1.00 1.0
EmployeeNumber 1176.0 1036.430272 600.681874 1.0 499.75 1032.5 1574.50 2068.0
EnvironmentSatisfaction 1176.0 2.704932 1.094128 1.0 2.00 3.0 4.00 4.0
HourlyRate 1176.0 65.822279 20.292035 30.0 48.00 66.0 83.00 100.0
JobInvolvement 1176.0 2.740646 0.719277 1.0 2.00 3.0 3.00 4.0
JobLevel 1175.0 2.068936 1.109590 1.0 1.00 2.0 3.00 5.0
JobSatisfaction 1176.0 2.710034 1.113217 1.0 2.00 3.0 4.00 4.0
MonthlyIncome 1176.0 6525.534014 4707.507472 1009.0 2954.50 4950.5 8354.50 19973.0
MonthlyRate 1176.0 14467.688776 7107.658362 2094.0 8275.00 14488.0 20627.25 26999.0
NumCompaniesWorked 1176.0 2.709184 2.476191 0.0 1.00 2.0 4.00 9.0
PercentSalaryHike 1175.0 15.295319 3.698902 11.0 12.00 14.0 18.00 25.0
PerformanceRating 1175.0 3.162553 0.369115 3.0 3.00 3.0 3.00 4.0
RelationshipSatisfaction 1175.0 2.718298 1.084095 1.0 2.00 3.0 4.00 4.0
StandardHours 1176.0 80.000000 0.000000 80.0 80.00 80.0 80.00 80.0
StockOptionLevel 1176.0 0.795918 0.847033 0.0 0.00 1.0 1.00 3.0
TotalWorkingYears 1174.0 11.402896 8.368864 0.0 6.00 10.0 15.00 114.0
TrainingTimesLastYear 1176.0 2.809524 1.307697 0.0 2.00 3.0 3.00 6.0
WorkLifeBalance 1176.0 2.747449 0.691586 1.0 2.00 3.0 3.00 4.0
YearsAtCompany 1176.0 6.918367 6.107140 0.0 3.00 5.0 9.00 40.0
YearsInCurrentRole 1176.0 4.151361 3.501618 0.0 2.00 3.0 7.00 18.0
YearsSinceLastPromotion 1175.0 2.125106 3.183567 0.0 0.00 1.0 2.00 15.0
YearsWithCurrManager 1176.0 4.242347 7.210390 0.0 2.00 3.0 7.00 219.0

Statistical summary for object type columns

  • Over18 column has only 1 unique value. This column should be deleted.
In [6]:
col_obj=[x for x in df.columns.to_list() if str(df[x].dtype) == 'object']
df[col_obj].describe().T
Out[6]:
count unique top freq
Attrition 1176 2 No 991
BusinessTravel 1176 3 Travel_Rarely 839
Department 1176 3 Research & Development 764
EducationField 1176 6 Life Sciences 477
Gender 1175 2 Male 693
JobRole 1176 9 Sales Executive 263
MaritalStatus 1176 3 Married 545
Over18 1176 1 Y 1176
OverTime 1175 2 No 838
In [7]:
for col in col_obj:
    print(df[col].value_counts())
    print()
No     991
Yes    185
Name: Attrition, dtype: int64

Travel_Rarely        839
Travel_Frequently    227
Non-Travel           110
Name: BusinessTravel, dtype: int64

Research & Development    764
Sales                     358
Human Resources            54
Name: Department, dtype: int64

Life Sciences       477
Medical             381
Marketing           127
Technical Degree     97
Other                69
Human Resources      25
Name: EducationField, dtype: int64

Male      693
Female    482
Name: Gender, dtype: int64

Sales Executive              263
Research Scientist           220
Laboratory Technician        209
Manufacturing Director       122
Healthcare Representative    108
Manager                       79
Sales Representative          67
Research Director             65
Human Resources               43
Name: JobRole, dtype: int64

Married     545
Single      365
Divorced    266
Name: MaritalStatus, dtype: int64

Y    1176
Name: Over18, dtype: int64

No     838
Yes    337
Name: OverTime, dtype: int64

1.2 Apply Data Visualization to Check for Data Patterns

Correlation among numerical features (Heatmap)

  • The correlation chart indicates that attributes such as age, job satisfaction, JobInvolvement, and stock optional level are postively correlated.
In [8]:
corr = df[['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager']].corr()

sns.heatmap(corr, cmap="YlGnBu")
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9c67735d0>

Distribution of Each Categorical Variable (Pie Chart)

In [9]:
import plotly.express as px
for col in col_obj:
    figure = px.pie(df,df[col],title = str(col))
    figure.show()

Part 2: Data Preprocessing, Cleaning and Transformation

  • Data preprocessing, cleaning, transformation: identify potential data quality issues and properly address those issues

2.1 Formating Categorical Attributes

In [10]:
# stripping columns 
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

2.2 Duplicates

In [11]:
if df.duplicated().sum()==0:
    print('No duplicated rows')
No duplicated rows

2.3 Missing values

  • Percentage of rows with missing values: around 0.9%
  • Since we have very few number of rows that have missing vlues, we can delete the rows with missing values without impacting our Association Rule Mining Algorithm.
  • In cases when the percentage of missing values is large, we can replace missing value with majority value for categorical data and we can replace missing value with median/mean for numeric data.
In [12]:
df.isnull().sum()
Out[12]:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            2
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      1
HourlyRate                  0
JobInvolvement              0
JobLevel                    1
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    1
PercentSalaryHike           1
PerformanceRating           1
RelationshipSatisfaction    1
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           2
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     1
YearsWithCurrManager        0
dtype: int64
In [13]:
df2 = df.dropna(axis = 0, how = 'any')
print("Number of rows with missing values deleted:", df.shape[0]-df2.shape[0])
Number of rows with missing values deleted: 11

2.4 Features with No Variances/Variations

Delete two features with no variances becuase they don't add value to the analysis.

  1. Column 'EmployeeCount'
  2. Column 'StandardHours'
  3. Column 'Over18'
In [14]:
df2['EmployeeCount'].value_counts()
del df2['EmployeeCount']
In [15]:
df2['StandardHours'].value_counts()
del df2['StandardHours']
In [16]:
df2['Over18'].value_counts()
del df2['Over18']
In [17]:
print("Number of columns deleted:", df.shape[1]-df2.shape[1])
print("Number of columns of df2:", df2.shape[1])
Number of columns deleted: 3
Number of columns of df2: 32

2.5 Features with Outliers

Each column has one extremely large value that needs to be deleted.

  1. 'TotalWorkingYears'
  2. 'YearsWithCurrManager'
  3. 'DistanceFromHome'
In [18]:
print('Maximum value of Total Working Years:', max(df2['TotalWorkingYears']))
print('Maximum value of Years With Manager:', max(df2['YearsWithCurrManager']))
print('Maximum value of Years With Manager:', max(df2['DistanceFromHome']))
Maximum value of Total Working Years: 114.0
Maximum value of Years With Manager: 219
Maximum value of Years With Manager: 224.0
In [19]:
df3 = df2.loc[
    (df2['TotalWorkingYears']!=max(df2['TotalWorkingYears']))
    & (df2["YearsWithCurrManager"]!=max(df2["YearsWithCurrManager"]))
       &(df2['DistanceFromHome'] !=max(df2["DistanceFromHome"]))]
print('Number of columns with outliers deleted:', df2.shape[0]-df3.shape[0])
Number of columns with outliers deleted: 3
In [20]:
fig=plt.figure(figsize=(12,4)) # crate a new figure
ax0=fig.add_subplot(121)
ax0 = sns.boxplot(x=df2["TotalWorkingYears"])
ax0.set_title("With an extreme value",fontsize=10)
ax1=fig.add_subplot(122)
ax1 = sns.boxplot(x=df3["TotalWorkingYears"])
ax1.set_title("Without an extreme value",fontsize=10);
In [21]:
fig=plt.figure(figsize=(12,4)) # crate a new figure
ax0=fig.add_subplot(121)
ax0 = sns.boxplot(x=df2["YearsWithCurrManager"])
ax0.set_title("With an extreme value",fontsize=10)
ax1=fig.add_subplot(122)
ax1 = sns.boxplot(x=df3["YearsWithCurrManager"])
ax1.set_title("Without an extreme value",fontsize=10);
In [22]:
fig=plt.figure(figsize=(12,4)) # crate a new figure
ax0=fig.add_subplot(121)
ax0 = sns.boxplot(x=df2["DistanceFromHome"])
ax0.set_title("With an extreme value",fontsize=10)
ax1=fig.add_subplot(122)
ax1 = sns.boxplot(x=df3["DistanceFromHome"])
ax1.set_title("Without an extreme value",fontsize=10);

2.6 Discretize Numerical Attributes

2.6.1 Number of Numeric Columns

In [23]:
numeric_cols = df3.select_dtypes([np.number]).columns
print("Number of numeric columns",'\n',len(numeric_cols))
Number of numeric columns 
 24

2.6.2 Distribution of Numeric Variables

In [24]:
df[numeric_cols].hist(figsize=(15,20));

2.6.3 Based on the above distribution plot, we can categorize numeric columns into 2 Types.

  • Type 1: Columns that have relatively few unique values¶
  • Type 2: Columns that have a relatively wide range of values

2.6.4 Since Type 1 columns has limited values. We simply change their dtypes from numeric to objects.

  • Education: 5 levels(1,2,3,4,5)
  • JobInvolvement: 4 levels(1,2,3,4)
  • JobLevel: 5 levels(1,2,3,4,5)
  • PerformanceRating: 2 levels(3,4)
  • RelationshipSatisfaction: 4 levels(1,2,3,4)
  • WorkLifeBalance: 4 levels (1,2,3,4)
  • StockOptionLevel: 4 levels (0,1,2,3)
  • JobSatisfaction: 4 levels (1,2,3,4)
  • EnvironmentSatisfaction: 4 levels (1,2,3,4)
  • TrainingTimesLastYear: 7 levels (0,1,2,3,4,5,6)
In [25]:
col_list = ['Education','JobInvolvement','JobLevel','PerformanceRating','RelationshipSatisfaction','WorkLifeBalance','StockOptionLevel','JobSatisfaction','EnvironmentSatisfaction','TrainingTimesLastYear']
for col in col_list:
    print(df3[col].value_counts())
3    447
4    309
2    234
1    135
5     37
Name: Education, dtype: int64
3    685
2    291
4    121
1     65
Name: JobInvolvement, dtype: int64
1.0    428
2.0    427
3.0    171
4.0     80
5.0     56
Name: JobLevel, dtype: int64
3.0    972
4.0    190
Name: PerformanceRating, dtype: int64
3.0    361
4.0    345
2.0    239
1.0    217
Name: RelationshipSatisfaction, dtype: int64
3    725
2    268
4    105
1     64
Name: WorkLifeBalance, dtype: int64
0    494
1    478
2    127
3     63
Name: StockOptionLevel, dtype: int64
4    358
3    351
1    239
2    214
Name: JobSatisfaction, dtype: int64
3    359
4    346
1    230
2    227
Name: EnvironmentSatisfaction, dtype: int64
2    431
3    386
5     97
4     95
6     55
1     51
0     47
Name: TrainingTimesLastYear, dtype: int64
In [26]:
for col in col_list:
    df3.astype({col: 'int32'}).dtypes
    print(col,':',type(col))
Education : <class 'str'>
JobInvolvement : <class 'str'>
JobLevel : <class 'str'>
PerformanceRating : <class 'str'>
RelationshipSatisfaction : <class 'str'>
WorkLifeBalance : <class 'str'>
StockOptionLevel : <class 'str'>
JobSatisfaction : <class 'str'>
EnvironmentSatisfaction : <class 'str'>
TrainingTimesLastYear : <class 'str'>
In [27]:
print("Number of numeric columns transformed to objects",'\n',len(col_list))
Number of numeric columns transformed to objects 
 10

2.6.5 Type2 Columns that have a relatively wide range of values. We process these columns using pd.qcut and pd.cut methods.

Use pd.qcut for

  • MonthlyIncome
  • YearsWithCurrManager
In [28]:
# pd.qcut
# MontlyIncome
df3['MonthlyIncome_grp'] = pd.qcut(df3['MonthlyIncome'],3,labels = ['low_mon_inc','med_mon_inc','high_mon_inc'])
# YearsWithCurrManager
df3['YearsWithCurrManager_grp'] = pd.qcut(df3['YearsWithCurrManager'],3,labels = ['rec_manager','some_time_manager','long_time_manager'])
/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Use pd.cut for

  • Age
  • DailyRate
  • DistanceFromHome
  • EmployeeNumber
  • HourlyRate
  • PercentSalaryHike
  • TotalWorkingYears
  • YearsAtCompany
  • YearsInCurrentRole
  • YearsSinceLastPromotion
  • NumCompaniesWorked
In [29]:
# pd.cut 
# NumCompaniesWorked
df3["NumCompaniesWorked_grp"] = pd.cut(df3["NumCompaniesWorked"], 3, labels =['few_comps','mid_comps','many_comps'])
# Age 
df3['Age_grp'] = pd.cut(df3['Age'],4,labels = ['young_age','middle_age','senior_age','old_age'])
# DailyRate
df3['DailyRate_grp'] = pd.cut(df3['DailyRate'],3,labels = ['low_daily_rate','med_daily_rate','high_daily_rate'])
# DistanceFromHome
df3['DistanceFromHome_grp'] = pd.cut(df3['DistanceFromHome'],4,labels = ['walking_dist','short_dist','mid_dist','long_dist'])
# EmployeeNumber
df3['EmployeeNumber_grp'] = pd.cut(df3['EmployeeNumber'],3,labels = ['few_employ','many_employ','a_lot_employ'])
# HourlyRate
df3['HourlyRate_grp'] = pd.cut(df3['HourlyRate'],3,labels = ['low_hr_rate','med_hr_rate','high_hr_rate'])
# PercentSalaryHike
df3['PercentSalaryHike_grp'] = pd.cut(df3['PercentSalaryHike'],3,labels = ['low_per_hike','med_per_hike','high_per_hike'])
# TotalWorkingYears
df3['TotalWorkingYears_grp'] = pd.cut(df3['TotalWorkingYears'],3,labels = ['few_wrok_yrs','many_work_yrs','a_lot_work_yrs'])
# YearsAtCompany
df3['YearsAtCompany_grp'] = pd.cut(df3['YearsAtCompany'],3,labels = ['few_yrs_comp','many_yrs_comp','a_lot_yrs_comp'])
# YearsInCurrentRole
df3['YearsInCurrentRole_grp'] = pd.cut(df3['YearsInCurrentRole'],3,labels = ['few_yrs_role','many_yrs_role','a_lot_yrs_role'])
# YearsSinceLastPromotion
df3['YearsSinceLastPromotion_grp'] = pd.cut(df3['YearsSinceLastPromotion'],3,labels = ['rec_since_promo','some_time_since_promo','long_time_since_promo'])
# MontlyRate
df3['MonthlyRate_grp'] = pd.cut(df3['MonthlyRate'],3,labels = ['low_mon_rate','med_mon_rate','high_mon_rate'])
/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:17: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:19: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:21: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:25: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

2.6.6 Drop the Categorical Columns

In [30]:
# list of columns to drop 
col
col_to_drop = list(set(numeric_cols) - set(col_list))
print(col_to_drop)
['YearsSinceLastPromotion', 'MonthlyRate', 'MonthlyIncome', 'YearsInCurrentRole', 'Age', 'YearsAtCompany', 'NumCompaniesWorked', 'DistanceFromHome', 'TotalWorkingYears', 'PercentSalaryHike', 'YearsWithCurrManager', 'EmployeeNumber', 'DailyRate', 'HourlyRate']
In [31]:
# delete columns 
for col in col_to_drop:
    del df3[col]
In [32]:
print(df3.shape)
df3.head()
(1162, 32)
Out[32]:
Attrition BusinessTravel Department Education EducationField EnvironmentSatisfaction Gender JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus OverTime PerformanceRating RelationshipSatisfaction StockOptionLevel TrainingTimesLastYear WorkLifeBalance MonthlyIncome_grp YearsWithCurrManager_grp NumCompaniesWorked_grp Age_grp DailyRate_grp DistanceFromHome_grp EmployeeNumber_grp HourlyRate_grp PercentSalaryHike_grp TotalWorkingYears_grp YearsAtCompany_grp YearsInCurrentRole_grp YearsSinceLastPromotion_grp MonthlyRate_grp
0 No Travel_Rarely Sales 1 Life Sciences 4 Male 3 2.0 Sales Executive 3 Married No 3.0 3.0 2 2 2 med_mon_inc rec_manager many_comps middle_age high_daily_rate mid_dist a_lot_employ high_hr_rate low_per_hike few_wrok_yrs few_yrs_comp few_yrs_role rec_since_promo low_mon_rate
1 No Travel_Rarely Research & Development 4 Life Sciences 4 Female 3 2.0 Laboratory Technician 3 Married No 4.0 2.0 1 3 3 low_mon_inc some_time_manager many_comps old_age high_daily_rate short_dist many_employ high_hr_rate med_per_hike few_wrok_yrs few_yrs_comp few_yrs_role rec_since_promo high_mon_rate
2 No Travel_Rarely Sales 2 Medical 3 Female 2 2.0 Sales Executive 3 Single No 3.0 1.0 0 6 3 med_mon_inc some_time_manager many_comps senior_age low_daily_rate short_dist many_employ med_hr_rate med_per_hike few_wrok_yrs few_yrs_comp few_yrs_role rec_since_promo low_mon_rate
3 No Non-Travel Research & Development 1 Medical 4 Male 2 4.0 Healthcare Representative 2 Divorced Yes 3.0 4.0 1 3 3 high_mon_inc rec_manager few_comps old_age low_daily_rate walking_dist many_employ low_hr_rate low_per_hike a_lot_work_yrs a_lot_yrs_comp many_yrs_role long_time_since_promo high_mon_rate
4 No Travel_Rarely Research & Development 3 Life Sciences 4 Female 2 3.0 Healthcare Representative 2 Single No 3.0 4.0 0 3 1 high_mon_inc some_time_manager few_comps middle_age med_daily_rate mid_dist a_lot_employ high_hr_rate med_per_hike few_wrok_yrs few_yrs_comp many_yrs_role rec_since_promo high_mon_rate

2.7 Format Dataframe for Later Analysis

In [33]:
df4 = pd.DataFrame({col: str(col)+'=' for col in df3}, 
                index=df3.index) + df3.astype(str)
df4.head()
Out[33]:
Attrition BusinessTravel Department Education EducationField EnvironmentSatisfaction Gender JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus OverTime PerformanceRating RelationshipSatisfaction StockOptionLevel TrainingTimesLastYear WorkLifeBalance MonthlyIncome_grp YearsWithCurrManager_grp NumCompaniesWorked_grp Age_grp DailyRate_grp DistanceFromHome_grp EmployeeNumber_grp HourlyRate_grp PercentSalaryHike_grp TotalWorkingYears_grp YearsAtCompany_grp YearsInCurrentRole_grp YearsSinceLastPromotion_grp MonthlyRate_grp
0 Attrition=No BusinessTravel=Travel_Rarely Department=Sales Education=1 EducationField=Life Sciences EnvironmentSatisfaction=4 Gender=Male JobInvolvement=3 JobLevel=2.0 JobRole=Sales Executive JobSatisfaction=3 MaritalStatus=Married OverTime=No PerformanceRating=3.0 RelationshipSatisfaction=3.0 StockOptionLevel=2 TrainingTimesLastYear=2 WorkLifeBalance=2 MonthlyIncome_grp=med_mon_inc YearsWithCurrManager_grp=rec_manager NumCompaniesWorked_grp=many_comps Age_grp=middle_age DailyRate_grp=high_daily_rate DistanceFromHome_grp=mid_dist EmployeeNumber_grp=a_lot_employ HourlyRate_grp=high_hr_rate PercentSalaryHike_grp=low_per_hike TotalWorkingYears_grp=few_wrok_yrs YearsAtCompany_grp=few_yrs_comp YearsInCurrentRole_grp=few_yrs_role YearsSinceLastPromotion_grp=rec_since_promo MonthlyRate_grp=low_mon_rate
1 Attrition=No BusinessTravel=Travel_Rarely Department=Research & Development Education=4 EducationField=Life Sciences EnvironmentSatisfaction=4 Gender=Female JobInvolvement=3 JobLevel=2.0 JobRole=Laboratory Technician JobSatisfaction=3 MaritalStatus=Married OverTime=No PerformanceRating=4.0 RelationshipSatisfaction=2.0 StockOptionLevel=1 TrainingTimesLastYear=3 WorkLifeBalance=3 MonthlyIncome_grp=low_mon_inc YearsWithCurrManager_grp=some_time_manager NumCompaniesWorked_grp=many_comps Age_grp=old_age DailyRate_grp=high_daily_rate DistanceFromHome_grp=short_dist EmployeeNumber_grp=many_employ HourlyRate_grp=high_hr_rate PercentSalaryHike_grp=med_per_hike TotalWorkingYears_grp=few_wrok_yrs YearsAtCompany_grp=few_yrs_comp YearsInCurrentRole_grp=few_yrs_role YearsSinceLastPromotion_grp=rec_since_promo MonthlyRate_grp=high_mon_rate
2 Attrition=No BusinessTravel=Travel_Rarely Department=Sales Education=2 EducationField=Medical EnvironmentSatisfaction=3 Gender=Female JobInvolvement=2 JobLevel=2.0 JobRole=Sales Executive JobSatisfaction=3 MaritalStatus=Single OverTime=No PerformanceRating=3.0 RelationshipSatisfaction=1.0 StockOptionLevel=0 TrainingTimesLastYear=6 WorkLifeBalance=3 MonthlyIncome_grp=med_mon_inc YearsWithCurrManager_grp=some_time_manager NumCompaniesWorked_grp=many_comps Age_grp=senior_age DailyRate_grp=low_daily_rate DistanceFromHome_grp=short_dist EmployeeNumber_grp=many_employ HourlyRate_grp=med_hr_rate PercentSalaryHike_grp=med_per_hike TotalWorkingYears_grp=few_wrok_yrs YearsAtCompany_grp=few_yrs_comp YearsInCurrentRole_grp=few_yrs_role YearsSinceLastPromotion_grp=rec_since_promo MonthlyRate_grp=low_mon_rate
3 Attrition=No BusinessTravel=Non-Travel Department=Research & Development Education=1 EducationField=Medical EnvironmentSatisfaction=4 Gender=Male JobInvolvement=2 JobLevel=4.0 JobRole=Healthcare Representative JobSatisfaction=2 MaritalStatus=Divorced OverTime=Yes PerformanceRating=3.0 RelationshipSatisfaction=4.0 StockOptionLevel=1 TrainingTimesLastYear=3 WorkLifeBalance=3 MonthlyIncome_grp=high_mon_inc YearsWithCurrManager_grp=rec_manager NumCompaniesWorked_grp=few_comps Age_grp=old_age DailyRate_grp=low_daily_rate DistanceFromHome_grp=walking_dist EmployeeNumber_grp=many_employ HourlyRate_grp=low_hr_rate PercentSalaryHike_grp=low_per_hike TotalWorkingYears_grp=a_lot_work_yrs YearsAtCompany_grp=a_lot_yrs_comp YearsInCurrentRole_grp=many_yrs_role YearsSinceLastPromotion_grp=long_time_since_promo MonthlyRate_grp=high_mon_rate
4 Attrition=No BusinessTravel=Travel_Rarely Department=Research & Development Education=3 EducationField=Life Sciences EnvironmentSatisfaction=4 Gender=Female JobInvolvement=2 JobLevel=3.0 JobRole=Healthcare Representative JobSatisfaction=2 MaritalStatus=Single OverTime=No PerformanceRating=3.0 RelationshipSatisfaction=4.0 StockOptionLevel=0 TrainingTimesLastYear=3 WorkLifeBalance=1 MonthlyIncome_grp=high_mon_inc YearsWithCurrManager_grp=some_time_manager NumCompaniesWorked_grp=few_comps Age_grp=middle_age DailyRate_grp=med_daily_rate DistanceFromHome_grp=mid_dist EmployeeNumber_grp=a_lot_employ HourlyRate_grp=high_hr_rate PercentSalaryHike_grp=med_per_hike TotalWorkingYears_grp=few_wrok_yrs YearsAtCompany_grp=few_yrs_comp YearsInCurrentRole_grp=many_yrs_role YearsSinceLastPromotion_grp=rec_since_promo MonthlyRate_grp=high_mon_rate

Part 3: Visualize Frequent Itemsets (More EDA)

In [34]:
# use pd.melt methond to pivot the dataframe
melted_data = pd.melt(df4)
frequency = melted_data.groupby(by=['value'])['value'].count().sort_values(ascending=True)
freq_itemset = pd.DataFrame({'item':frequency.index,'frequency':frequency.values})

3.1 the 10 least frequent itemsets

In [35]:
g = sns.barplot(data = freq_itemset.head(10),x='item',y='frequency',color="salmon", saturation=.5)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
plt.show()

3.2 the 10 most frequent itemsets

In [36]:
# Visualize the 10 most frequent itemsets
g = sns.barplot(data = freq_itemset.tail(10),x='item',y='frequency',color="salmon", saturation=.5)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
plt.show()

Part 4: Modeling

4.1 Data processing for the Apriori library

  • The Apriori library requires the dataset to be in the form of a list of lists
In [37]:
records = []
for i in range(0,len(df4)):
    records.append([str(df4.values[i,j]) 
    for j in range(0, len(df4.columns))])
records[0] 
Out[37]:
['Attrition=No',
 'BusinessTravel=Travel_Rarely',
 'Department=Sales',
 'Education=1',
 'EducationField=Life Sciences',
 'EnvironmentSatisfaction=4',
 'Gender=Male',
 'JobInvolvement=3',
 'JobLevel=2.0',
 'JobRole=Sales Executive',
 'JobSatisfaction=3',
 'MaritalStatus=Married',
 'OverTime=No',
 'PerformanceRating=3.0',
 'RelationshipSatisfaction=3.0',
 'StockOptionLevel=2',
 'TrainingTimesLastYear=2',
 'WorkLifeBalance=2',
 'MonthlyIncome_grp=med_mon_inc',
 'YearsWithCurrManager_grp=rec_manager',
 'NumCompaniesWorked_grp=many_comps',
 'Age_grp=middle_age',
 'DailyRate_grp=high_daily_rate',
 'DistanceFromHome_grp=mid_dist',
 'EmployeeNumber_grp=a_lot_employ',
 'HourlyRate_grp=high_hr_rate',
 'PercentSalaryHike_grp=low_per_hike',
 'TotalWorkingYears_grp=few_wrok_yrs',
 'YearsAtCompany_grp=few_yrs_comp',
 'YearsInCurrentRole_grp=few_yrs_role',
 'YearsSinceLastPromotion_grp=rec_since_promo',
 'MonthlyRate_grp=low_mon_rate']

4.2 Run association rule mining algorithm using default settings as a baseline model

In [38]:
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)

# For our convenience, we can turn the encoded array into a pandas DataFrame
df5 = pd.DataFrame(te_ary, columns=te.columns_)
df5.head(3)
Out[38]:
Age_grp=middle_age Age_grp=old_age Age_grp=senior_age Age_grp=young_age Attrition=No Attrition=Yes BusinessTravel=Non-Travel BusinessTravel=Travel_Frequently BusinessTravel=Travel_Rarely DailyRate_grp=high_daily_rate DailyRate_grp=low_daily_rate DailyRate_grp=med_daily_rate Department=Human Resources Department=Research & Development Department=Sales DistanceFromHome_grp=long_dist DistanceFromHome_grp=mid_dist DistanceFromHome_grp=short_dist DistanceFromHome_grp=walking_dist Education=1 Education=2 Education=3 Education=4 Education=5 EducationField=Human Resources EducationField=Life Sciences EducationField=Marketing EducationField=Medical EducationField=Other EducationField=Technical Degree EmployeeNumber_grp=a_lot_employ EmployeeNumber_grp=few_employ EmployeeNumber_grp=many_employ EnvironmentSatisfaction=1 EnvironmentSatisfaction=2 EnvironmentSatisfaction=3 EnvironmentSatisfaction=4 Gender=Female Gender=Male HourlyRate_grp=high_hr_rate HourlyRate_grp=low_hr_rate HourlyRate_grp=med_hr_rate JobInvolvement=1 JobInvolvement=2 JobInvolvement=3 JobInvolvement=4 JobLevel=1.0 JobLevel=2.0 JobLevel=3.0 JobLevel=4.0 JobLevel=5.0 JobRole=Healthcare Representative JobRole=Human Resources JobRole=Laboratory Technician JobRole=Manager JobRole=Manufacturing Director JobRole=Research Director JobRole=Research Scientist JobRole=Sales Executive JobRole=Sales Representative JobSatisfaction=1 JobSatisfaction=2 JobSatisfaction=3 JobSatisfaction=4 MaritalStatus=Divorced MaritalStatus=Married MaritalStatus=Single MonthlyIncome_grp=high_mon_inc MonthlyIncome_grp=low_mon_inc MonthlyIncome_grp=med_mon_inc MonthlyRate_grp=high_mon_rate MonthlyRate_grp=low_mon_rate MonthlyRate_grp=med_mon_rate NumCompaniesWorked_grp=few_comps NumCompaniesWorked_grp=many_comps NumCompaniesWorked_grp=mid_comps OverTime=No OverTime=Yes PercentSalaryHike_grp=high_per_hike PercentSalaryHike_grp=low_per_hike PercentSalaryHike_grp=med_per_hike PerformanceRating=3.0 PerformanceRating=4.0 RelationshipSatisfaction=1.0 RelationshipSatisfaction=2.0 RelationshipSatisfaction=3.0 RelationshipSatisfaction=4.0 StockOptionLevel=0 StockOptionLevel=1 StockOptionLevel=2 StockOptionLevel=3 TotalWorkingYears_grp=a_lot_work_yrs TotalWorkingYears_grp=few_wrok_yrs TotalWorkingYears_grp=many_work_yrs TrainingTimesLastYear=0 TrainingTimesLastYear=1 TrainingTimesLastYear=2 TrainingTimesLastYear=3 TrainingTimesLastYear=4 TrainingTimesLastYear=5 TrainingTimesLastYear=6 WorkLifeBalance=1 WorkLifeBalance=2 WorkLifeBalance=3 WorkLifeBalance=4 YearsAtCompany_grp=a_lot_yrs_comp YearsAtCompany_grp=few_yrs_comp YearsAtCompany_grp=many_yrs_comp YearsInCurrentRole_grp=a_lot_yrs_role YearsInCurrentRole_grp=few_yrs_role YearsInCurrentRole_grp=many_yrs_role YearsSinceLastPromotion_grp=long_time_since_promo YearsSinceLastPromotion_grp=rec_since_promo YearsSinceLastPromotion_grp=some_time_since_promo YearsWithCurrManager_grp=long_time_manager YearsWithCurrManager_grp=rec_manager YearsWithCurrManager_grp=some_time_manager
0 True False False False True False False False True True False False False False True False True False False True False False False False False True False False False False True False False False False False True False True True False False False False True False False True False False False False False False False False False False True False False False True False False True False False False True False True False False True False True False False True False True False False False True False False False True False False True False False False True False False False False False True False False False True False False True False False True False False True False
1 False True False False True False False False True True False False False True False False False True False False False False True False False True False False False False False False True False False False True True False True False False False False True False False True False False False False False True False False False False False False False False True False False True False False True False True False False False True False True False False False True False True False True False False False True False False False True False False False False True False False False False False True False False True False False True False False True False False False True
2 False False True False True False False False True False True False False False True False False True False False True False False False False False False True False False False False True False False True False True False False False True False True False False False True False False False False False False False False False False True False False False True False False False True False False True False True False False True False True False False False True True False True False False False True False False False False True False False False False False False False True False False True False False True False False True False False True False False False True
In [39]:
df5.to_csv('df5.csv',header=True, encoding = 'utf-8',index = False)
In [40]:
frequent_itemsets = apriori(df5,use_colnames=True)
frequent_itemsets.sort_values(by='support',ascending=False)
rules = association_rules(frequent_itemsets, metric="lift")
rules.head()
Out[40]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (Attrition=No) (BusinessTravel=Travel_Rarely) 0.842513 0.715146 0.613597 0.728294 1.018385 0.011077 1.048390
1 (BusinessTravel=Travel_Rarely) (Attrition=No) 0.715146 0.842513 0.613597 0.858002 1.018385 0.011077 1.109083
2 (Attrition=No) (Department=Research & Development) 0.842513 0.648881 0.566265 0.672114 1.035805 0.019574 1.070858
3 (Department=Research & Development) (Attrition=No) 0.648881 0.842513 0.566265 0.872679 1.035805 0.019574 1.236930
4 (Attrition=No) (DistanceFromHome_grp=walking_dist) 0.842513 0.574871 0.505164 0.599591 1.043002 0.020827 1.061738

Plot Association Rules for baseline model

In [41]:
b = sns.scatterplot(data=rules,x='support',y='confidence',hue='lift')
plt.show()

4.3 Fine tune the model by experimenting with different algorithm hyper-parameters

  • Supoprt: (Transactions containing (B))/(Total Transactions)
  • Confidence: (Transactions containing both (A and B))/(Transactions containing A)
  • Lift: how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. (Confidence (A→B))/(Support (B))
In [42]:
frequent_itemsets2 = apriori(df5, min_support=0.6, use_colnames=True)
frequent_itemsets2.sort_values(by='support',ascending=False)
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)
rules2[(rules['lift']>1) & (rules['confidence'] > 0.5)].head(10)
/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[42]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (Attrition=No) (BusinessTravel=Travel_Rarely) 0.842513 0.715146 0.613597 0.728294 1.018385 0.011077 1.048390
1 (BusinessTravel=Travel_Rarely) (Attrition=No) 0.715146 0.842513 0.613597 0.858002 1.018385 0.011077 1.109083
2 (Attrition=No) (OverTime=No) 0.842513 0.713425 0.636833 0.755873 1.059499 0.035763 1.173878
3 (OverTime=No) (Attrition=No) 0.713425 0.842513 0.636833 0.892642 1.059499 0.035763 1.466930
4 (Attrition=No) (PerformanceRating=3.0) 0.842513 0.836489 0.708262 0.840654 1.004979 0.003509 1.026138
5 (PerformanceRating=3.0) (Attrition=No) 0.836489 0.842513 0.708262 0.846708 1.004979 0.003509 1.027365
6 (PerformanceRating=3.0) (BusinessTravel=Travel_Rarely) 0.836489 0.715146 0.603270 0.721193 1.008456 0.005058 1.021689
7 (BusinessTravel=Travel_Rarely) (PerformanceRating=3.0) 0.715146 0.836489 0.603270 0.843562 1.008456 0.005058 1.045214
8 (BusinessTravel=Travel_Rarely) (YearsSinceLastPromotion_grp=rec_since_promo) 0.715146 0.861446 0.617040 0.862816 1.001590 0.000980 1.009987
9 (YearsSinceLastPromotion_grp=rec_since_promo) (BusinessTravel=Travel_Rarely) 0.861446 0.715146 0.617040 0.716284 1.001590 0.000980 1.004009

Plot Association Rules for new model (A comparision of the base model and the new model)

In [43]:
fig=plt.figure(figsize=(12,4)) # crate a new figure
ax0=fig.add_subplot(121)
ax0 = sns.scatterplot(data=rules,x='support',y='confidence',hue='lift')
ax0.set_title("Baseline Model",fontsize=8)
ax1=fig.add_subplot(122)
ax1 = sns.scatterplot(data=rules2,x='support',y='confidence',hue='lift')
ax1.set_title("Parameter Tuned",fontsize=8);

4.3 Discuss how tuning those hyper-parameters could impact the model performance (e.g. overfitting or underfiting).

  • In the Assoication Rule Minning model, we have three hyper-paremeters. They are Support, Confidence, and Lift. These parameters are used to exclude rules in the result that have a support or a confidence lower than the minimum support, minimum confidence ,and lift respectively.
  • Underfitting may occurs if we don't set the values of three hyper-paremeter or set their values at a very low level. Under such situation, our model is going to reutrn us too many results relative to the number of itemsets. It can netither capture the undering pattern of the data nor generalize to new data.
  • Overfitting may occurs if we set the values of three hyper-parameter very high or use a unusual combination of values. Under such situation, our model is going to give us too few results relative to the number of itemsets, which means that our model is too closely fit to a limited set of data points. In other words, the returned combinations we see in that speicfic dataset might not be observed in another similar dataset.

Part 5: Classification with Association Rule Mining

  • Output and present the most interesting and significant rules which could predict “Attrition”
  • print out the top 5 rules which predict those who stay vs. who leave, respectively

5.1 Preparation

Step1: Looking at the distribution of "Attribution".

  • The support for "Attrition=Yes" is around 15.73% while the support for "Attrition=No" is around 84.27%.
  • This indicates that we should choose a relatively high support value for "Attrition=Yes" and a realtively low support value for "Attrition=No" when doing the classification.
In [44]:
df3['Attrition'].value_counts()
print('Support of Attrition=Yes:', 185/(991+185))
print('Support of Attrition=No:', 991/(991+185))
Support of Attrition=Yes: 0.15731292517006804
Support of Attrition=No: 0.842687074829932

Step2: Defining a function to do the classification with Association Rule Mining

In [45]:
def SupervisedApriori(data,consequent,min_supp,min_conf,min_lift):
    frequent_itemsets = apriori(data, min_support=min_supp, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_conf)
    rules = rules[rules['lift'] > min_lift]
    sup_rules = pd.DataFrame()
    for i in consequent:
        df = rules[rules['consequents'] == {i}]
        sup_rules = sup_rules.append(df,ignore_index = True)
    return(sup_rules)

5.2 The top 5 rules which predict those who stay

Step1: Coming up with the a list of rules that have "Attrition=No" as consequents

In [46]:
att_no = SupervisedApriori(df5,consequent = ['Attrition=No'],min_supp=0.5, min_conf=0.5, min_lift=1)
att_no = att_no.sort_values(by='lift',ascending = False)
att_no.head()
Out[46]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
5 (OverTime=No) (Attrition=No) 0.713425 0.842513 0.636833 0.892642 1.059499 0.035763 1.466930
14 (PerformanceRating=3.0, OverTime=No) (Attrition=No) 0.596386 0.842513 0.531842 0.891775 1.058470 0.029379 1.455181
16 (OverTime=No, YearsSinceLastPromotion_grp=rec_since_promo) (Attrition=No) 0.612737 0.842513 0.545611 0.890449 1.056897 0.029372 1.437574
15 (YearsAtCompany_grp=few_yrs_comp, OverTime=No) (Attrition=No) 0.635112 0.842513 0.564544 0.888889 1.055045 0.029454 1.417384
19 (YearsAtCompany_grp=few_yrs_comp, OverTime=No, YearsSinceLastPromotion_grp=rec_since_promo) (Attrition=No) 0.575731 0.842513 0.509466 0.884903 1.050314 0.024405 1.368297

Step2: removing redundant rules: in the presence of more general rules, the specific rules are considereed as redundant and should be removed.

In [47]:
att_no_2 = att_no[:]
alist = []

for i in range(0,20):
    for j in range(1,20):
        if set(att_no_2['antecedents'][i]).issubset(set(att_no_2['antecedents'][j])) and att_no_2['antecedents'][i]!= att_no_2['antecedents'][j]:
            alist.append(i)

alist = set(alist)
alist = list(alist)
print(alist)

att_no_2.drop(df.index[alist],inplace=True)
[0, 4, 5, 6, 7, 15, 16]
/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [48]:
att_no_2.head(5)
Out[48]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
14 (PerformanceRating=3.0, OverTime=No) (Attrition=No) 0.596386 0.842513 0.531842 0.891775 1.058470 0.029379 1.455181
19 (YearsAtCompany_grp=few_yrs_comp, OverTime=No, YearsSinceLastPromotion_grp=rec_since_promo) (Attrition=No) 0.575731 0.842513 0.509466 0.884903 1.050314 0.024405 1.368297
2 (DistanceFromHome_grp=walking_dist) (Attrition=No) 0.574871 0.842513 0.505164 0.878743 1.043002 0.020827 1.298782
1 (Department=Research & Development) (Attrition=No) 0.648881 0.842513 0.566265 0.872679 1.035805 0.019574 1.236930
8 (WorkLifeBalance=3) (Attrition=No) 0.623924 0.842513 0.537866 0.862069 1.023212 0.012201 1.141781

5.3 The top 5 rules which predict those who leaves

Step1: Coming up with the a list of rules that have "Attrition=No" as consequents

In [49]:
att_yes = SupervisedApriori(df5,consequent = ['Attrition=Yes'],min_supp=0.1, min_conf=0.0, min_lift=0.0)
att_yes = att_yes.sort_values(by='lift',ascending=False)
att_yes.head()
Out[49]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
2 (StockOptionLevel=0) (Attrition=Yes) 0.425129 0.157487 0.102410 0.240891 1.529590 0.035457 1.109870
27 (YearsAtCompany_grp=few_yrs_comp, YearsWithCurrManager_grp=rec_manager, YearsSinceLastPromotion_... (Attrition=Yes) 0.467298 0.157487 0.100688 0.215470 1.368173 0.027095 1.073907
19 (YearsWithCurrManager_grp=rec_manager, YearsSinceLastPromotion_grp=rec_since_promo) (Attrition=Yes) 0.469019 0.157487 0.100688 0.214679 1.363152 0.026824 1.072826
17 (YearsAtCompany_grp=few_yrs_comp, YearsWithCurrManager_grp=rec_manager) (Attrition=Yes) 0.475904 0.157487 0.101549 0.213382 1.354915 0.026600 1.071057
7 (YearsWithCurrManager_grp=rec_manager) (Attrition=Yes) 0.480207 0.157487 0.101549 0.211470 1.342774 0.025923 1.068460

Step2: removing redundant rules: in the presence of more general rules, the specific rules are considereed as redundant and should be removed.

In [50]:
att_yes_2 = att_yes[:]
alist2 = []

for i in range(0,29):
    for j in range(1,29):
        if set(att_yes_2['antecedents'][i]).issubset(set(att_yes_2['antecedents'][j])) and att_yes_2['antecedents'][i]!= att_yes_2['antecedents'][j]:
            alist2.append(i)

alist2 = set(alist2)
alist2 = list(alist2)
print(alist2)

att_yes_2.drop(df.index[alist2],inplace=True)
[1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26]
/Users/jieerchen/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [51]:
att_yes_2.head(5)
Out[51]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
2 (StockOptionLevel=0) (Attrition=Yes) 0.425129 0.157487 0.102410 0.240891 1.529590 0.035457 1.109870
27 (YearsAtCompany_grp=few_yrs_comp, YearsWithCurrManager_grp=rec_manager, YearsSinceLastPromotion_... (Attrition=Yes) 0.467298 0.157487 0.100688 0.215470 1.368173 0.027095 1.073907
28 (YearsInCurrentRole_grp=few_yrs_role, YearsAtCompany_grp=few_yrs_comp, TotalWorkingYears_grp=few... (Attrition=Yes) 0.522375 0.157487 0.110155 0.210873 1.338987 0.027888 1.067652
21 (YearsAtCompany_grp=few_yrs_comp, PerformanceRating=3.0, YearsInCurrentRole_grp=few_yrs_role) (Attrition=Yes) 0.550775 0.157487 0.100688 0.182813 1.160809 0.013949 1.030991
20 (YearsAtCompany_grp=few_yrs_comp, PerformanceRating=3.0, TotalWorkingYears_grp=few_wrok_yrs) (Attrition=Yes) 0.597246 0.157487 0.105852 0.177233 1.125384 0.011793 1.024000
In [52]:
# show the full antecedents
print('Rule 2:', att_yes_2['antecedents'][27],'\n')
print('Rule 3:', att_yes_2['antecedents'][28],'\n')
Rule 2: frozenset({'YearsAtCompany_grp=few_yrs_comp', 'YearsWithCurrManager_grp=rec_manager', 'YearsSinceLastPromotion_grp=rec_since_promo'}) 

Rule 3: frozenset({'YearsInCurrentRole_grp=few_yrs_role', 'YearsAtCompany_grp=few_yrs_comp', 'TotalWorkingYears_grp=few_wrok_yrs', 'YearsSinceLastPromotion_grp=rec_since_promo'}) 

Part 6: Model Interpretation

  • Provide interpretations of the above chosen association rules and also discuss why you consider them interesting and significant.

6.1 Discovering interesting and significant rules using 'Lift'

  • In IST687, I learned that a more interesting rule could be a more useful rule (significant) becuase it is more novel and unexpected.
  • Lift is one such measure to tell how interesting and significant a rule is. The lift value of an association rule is the ratio of the confidence of the rule and the expected confidence of the rule. Intuitively speaking, lift measures how much better the rule is at predicting presence of a consequent compared to just relying on the raw probability of the consequent in the dataset.
  • Therefore, in part 4, to find out interesting and significant rules, I used 'Lift' to sort the rules at the desending order. After removing the duplicated rules, I selected the top 5 rules for "Attrition=Yes" and "Attrition=No" respeectively.

6.2 Those who stay

Rule 1: (PerformanceRating=3.0, OverTime=No) --> (Attrition=No)

  • This group of employees has relatively good job performance and doesn't work overtime, indicating that these employees have low work stress and a relatively good work-life balance.
  • This rule is intesting becuaes sometimes we tend to think that, after being capable of doing current job, people desires a new job that is more challenging than the current one.

Rule 2: (YearsAtCompany_grp=few_yrs_comp, YearsSinceLastPromotion_grp=rec_since_promo, OverTime=No) --> (Attrition=No)

  • This group of employees has been in their company for a few years and have been promoted recently. They don't work overtime either. Therefore, we can conclude that this group of employees is very capable of doing their job and has a promising future in their current companies.
  • This rule is intersting in two ways. First, star workers is very likely to receive invitations from recuriters about new job opportunities and to jump to a new company. Second, some people like to jump to a new company for a big rise if they are well-recognized in their current company.

Rule 3: (DistanceFromHome_grp=walking_dist) --> (Attrition=No)

  • The group of employees live very close to their companies.
  • This rule is interesting beccuase in real life, we often underlook how living close to a company can help imporving quality of life.

Rule 4: (Department=Research & Development) --> (Attrition=No)

  • This group of employees are working in the Research & Development field.
  • This rule is interesting in a way that these well-educated and workers are very popular among companies and could have many great working opportunites. Thus, they have higher loyalty than expected.

Rule 5: (WorkLifeBalance=3) --> (Attrition=No)

  • This group of employees have acceptable level of worklifebalance.
  • This rule is intersting and siginificant becuase it tells us that companies don't have to always provide the best work-life balance for thier employees when they want to decrease attrition rate.

6.3 Those who leaves

Rule 1: (StockOptionLevel=0) --> (Attrition=Yes)

  • This group of employees doesn't have any stock.
  • Based on our common sense, employees with no stock option are generally new to companies and would work very hard to try to get promoted or rises in the future.

Rule 2: ('YearsAtCompany_grp=few_yrs_comp', 'YearsSinceLastPromotion_grp=rec_since_promo', 'YearsWithCurrManager_grp=rec_manager') --> (Attrition=Yes)

  • This group of employees has been in their company for a few years and have been promoted recently. They are recently assigned to their current manager.
  • This rule is intersting becuase it tells us that geting a promotion isn't enough to keep an employee staying in the company. Employees take thier manager into account when making decision of leaving.

Rule 3: ('YearsAtCompany_grp=few_yrs_comp', 'TotalWorkingYears_grp=few_wrok_yrs', 'YearsInCurrentRole_grp=few_yrs_role', 'YearsSinceLastPromotion_grp=rec_since_promo') --> (Attrition=Yes)

  • This group of employees have limited working expereinces and they seems to be doing their first job. Not long after promotion, they left the company.
  • This rule is significant becuase it warns companies that when promoting relatively new employees, companies should take into account employees' willingness to stay in the company.

Rule 4: (YearsAtCompany_grp=few_yrs_comp, YearsInCurrentRole_grp=few_yrs_role, PerformanceRating=3.0) --> (Attrition=Yes)

  • This group of employees seems to be doing the same role since they come to their companies. Their performance level is okay.
  • To effectviely decrease the attrition rate, company should look into how to retain these med-performance.

Rule 5: (YearsAtCompany_grp=few_yrs_comp, TotalWorkingYears_grp=few_wrok_yrs, PerformanceRating=3.0) --> (Attrition=Yes)

  • This group of employees seems to be having their first jobs in comapnies. They have med-level performance and they left the caompnies.
  • This rule is interesting becuase it tells us that in the first couple working years, employees wii decided if they are good at doing their jobs.